data clustering
Data Clustering by Markovian Relaxation and the Information Bottleneck Method
We introduce a new, non-parametric and principled, distance based clustering method. This method combines a pairwise based ap(cid:173) proach with a vector-quantization method which provide a mean(cid:173) ingful interpretation to the resulting clusters. The idea is based on turning the distance matrix into a Markov process and then examine the decay of mutual-information during the relaxation of this process. The clusters emerge as quasi-stable structures dur(cid:173) ing this relaxation, and then are extracted using the information bottleneck method. The method can cluster data with no geometric or other bias and makes no assumption about the underlying distribution.
Size Regularized Cut for Data Clustering
We present a novel spectral clustering method that enables users to incorporate prior knowledge of the size of clusters into the clustering process. The cost function, which is named size regularized cut (SRcut), is defined as the sum of the inter-cluster similarity and a regularization term measuring the relative size of two clusters. Finding a partition of the data set to minimize SRcut is proved to be NP-complete. An approximation algorithm is proposed to solve a relaxed version of the optimization problem as an eigenvalue problem. Evaluations over different data sets demonstrate that the method is not sensitive to outliers and performs better than normalized cut.
How to innovate my company using Artificial Intelligence?
Artificial Intelligence ( AI) has been one of the most recurrent topics of conversation and analysis in companies in recent years, especially among those who work on innovation within the company. Faced with an increasingly globalized world and in constant digital transformation, every day more professionals agree on the urgent need to incorporate and exploit disruptive technologies such as Artificial Intelligence to achieve the growth planned in the short, medium and long term. For those who are not familiar with the concept, we can say that Artificial Intelligence is a segment that belongs to the field of computer science. It is defined as the type of technology that allows systems and machines to simulate human intelligence . We refer to the ability to make decisions and also simulate actions that a human being would take.
Coarse-Refinement Dilemma: On Generalization Bounds for Data Clustering
Vaz, Yule, de Mello, Rodrigo Fernandes, Grossi, Carlos Henrique
This paper is organized as follows: Section 2 briefly introduces some studies related to the formalization of theoretical frameworks in the context of the Data Clustering (DC) problem; Section 3 introduces a general formulation for the DC and HC problems; Section 4 discusses the Coarse-Refinement Dilemma considering the homology group H 0; Section 5 shows that homology groups of degree greater than zero are affected by overrefined and over-coarsed topologies; Section 6 compares our proposed generalization bounds to Carlsson and M emoli [12]'s consistency; finally, conclusions and future directions are provided in Section 8. 2. Related work Data Clustering (DC) faces many challenges in defining and guaranteeing generalization from datasets, as it does not rely on labels and, consequently, it cannot take advantage of computing any evident error measurement such as risk [7]. While studying this issue, Kleinberg [8] considered that a clustering model is an application of a mapping f on top of a distance function d: I I R, given I contains indices of data points in some fixed-size set S, disregarding its ambient space though [25]. From this initial setup, Kleinberg [8] defined three properties to be respected in order to assess clustering algorithms and models: - Scale-invariance: Given a distance and a clustering function, d and f, and a scalar ฮฑ, the following must hold f (d) f (ฮฑd). Thus, the similarity representation over S must be consistent with the units of measurement; - Consistency: Let ฮ be a partition of S and d,d null two distance functions. Function d null is referred to as a ฮ transformation of d if: (i) for all i,j S belonging to the same cluster, d null (i,j) d( i,j); and (ii) for all i,j S belonging to different clusters, d null (i,j) d( i,j). Consistency holds if f (d null) f ( d) whenever d null is a ฮฃ transformation of d.
Structure Aware L1 Graph for Data Clustering
Han, Shuchu (Stony Brook Univsersity) | Qin, Hong (Stony Brook Univsersity)
In graph-oriented machine learning research, L1 graph is an efficient way to represent the connections of input data samples. Its construction algorithm is based on a numerical optimization motivated by Compressive Sensing theory. As a result, It is a nonparametric method which is highly demanded. However, the information of data such as geometry structure and density distribution are ignored. In this paper, we propose a Structure Aware (SA) L1 graph to improve the data clustering performance by capturing the manifold structure of input data. We use a local dictionary for each datum while calculating its sparse coefficients. SA-L1 graph not only preserves the locality of data but also captures the geometry structure of data. The experimental results show that our new algorithm has better clustering performance than L1 graph.
Data Clustering by Laplacian Regularized L1-Graph
Yang, Yingzhen (University of Illinois at Urbana-Champaign) | Wang, Zhangyang (University of Illinois at Urbana-Champaign) | Yang, Jianchao (Adobe Research) | Wang, Jiangping (University of Illinois at Urbana-Champaign) | Chang, Shiyu (University of Illinois at Urbana-Champaign) | Huang, Thomas S (University of Illinois at Urbana-Champaign)
L1-Graph has been proven to be effective in data clustering, which partitions the data space by using the sparse representation of the data as the similarity measure. However, the sparse representation is performed for each datum separately without taking into account the geometric structure of the data. Motivated by L1-Graph and manifold leaning, we propose Laplacian Regularized L1-Graph (LRโ1-Graph) for data clustering. The sparse representations of LRโ1-Graph are regularized by the geometric information of the data so that they vary smoothly along the geodesics of the data manifold by the graph Laplacian according to the manifold assumption. Moreover, we propose an iterative regularization scheme, where the sparse representation obtained from the previous iteration is used to build the graph Laplacian for the current iteration of regularization. The experimental results on real data sets demonstrate the superiority of our algorithm compared to L1-Graph and other competing clustering methods.
Active Data Clustering
Hofmann, Thomas, Buhmann, Joachim M.
Active data clustering is a novel technique for clustering of proximity data which utilizes principles from sequential experiment design in order to interleave data generation and data analysis. The proposed active data sampling strategy is based on the expected value of information, a concept rooting in statistical decision theory. This is considered to be an important step towards the analysis of largescale data sets, because it offers a way to overcome the inherent data sparseness of proximity data.
Active Data Clustering
Hofmann, Thomas, Buhmann, Joachim M.
Active data clustering is a novel technique for clustering of proximity data which utilizes principles from sequential experiment design in order to interleave data generation and data analysis. The proposed active data sampling strategy is based on the expected value of information, a concept rooting in statistical decision theory. This is considered to be an important step towards the analysis of largescale data sets, because it offers a way to overcome the inherent data sparseness of proximity data.
Active Data Clustering
Hofmann, Thomas, Buhmann, Joachim M.
Active data clustering is a novel technique for clustering of proximity datawhich utilizes principles from sequential experiment design in order to interleave data generation and data analysis. The proposed activedata sampling strategy is based on the expected value of information, a concept rooting in statistical decision theory. This is considered to be an important step towards the analysis of largescale datasets, because it offers a way to overcome the inherent data sparseness of proximity data.
Multidimensional Scaling and Data Clustering
Hofmann, Thomas, Buhmann, Joachim
Visualizing and structuring pairwise dissimilarity data are difficult combinatorial optimization problems known as multidimensional scaling or pairwise data clustering. Algorithms for embedding dissimilarity data set in a Euclidian space, for clustering these data and for actively selecting data to support the clustering process are discussed in the maximum entropy framework. Active data selection provides a strategy to discover structure in a data set efficiently with partially unknown data. 1 Introduction Grouping experimental data into compact clusters arises as a data analysis problem in psychology, linguistics, genetics and other experimental sciences. The data which are supposed to be clustered are either given by an explicit coordinate representation (central clustering) or, in the non-metric case, they are characterized by dissimilarity values for pairs of data points (pairwise clustering). In this paper we study algorithms (i) for embedding non-metric data in a D-dimensional Euclidian space, (ii) for simultaneous clustering and embedding of non-metric data, and (iii) for active data selection to determine a particular cluster structure with minimal number of data queries. All algorithms are derived from the maximum entropy principle (Hertz et al., 1991) which guarantees robust statistics (Tikochinsky et al., 1984).